Evidence infrastructure for the sandbox era

§ I · Why we asked for this meeting 02 / 12

We listened to Zach at CHAI. Three things he said changed how we're thinking about the sandbox.

Context Dana Point · CHAI 2026 · last week

Authors Jennifer Goldsack · [Founder]

02 / 12 I · the opening

§ II · What Zach said — i 03 / 12

“

I know it when I see it. I can evaluate any individual case. I can't yet write the rules that would let someone else do what my team does.

— Zach Boyd, CHAI 2026

The sandbox is at the stage where every approval depends on staff judgment. The next phase is turning case review into standards — and that transition needs a shared evidence base, not just more reviewers.

03 / 12 II · case review

§ III · What Zach said — ii 04 / 12

“

‘Human in the loop’ is an allocation problem. When can a clinician responsibly delegate ninety to one hundred percent of a decision?

— Zach Boyd, CHAI 2026

The honest answer requires evidence of what the system actually did — not a vendor self-report, not a policy document, not a monitoring dashboard the vendor also owns.

04 / 12 III · oversight budget

§ IV · What Zach said — iii 05 / 12

“

Once standards are set, they ossify. The underlying technology keeps moving. The standard stops moving with it.

— Zach Boyd, CHAI 2026

Standards at the policy layer. Evidence at the runtime layer. One can evolve without re-certifying the other — which is how a sandbox stays current as the systems it governs change underneath.

05 / 12 IV · decoupling

§ V · The gap we see 06 / 12

Three questions your sandbox has to answer eventually — and none of them can be answered from vendor self-reports alone.

Q I

Was the declared control actually running at the moment of the decision?

Q II

Did the system's behavior drift from its baseline over time — and if so, when?

Q III

Can an independent auditor verify the first two offline, years later, without trusting us or the vendor?

06 / 12 V · the gap

§ VI · The mechanism 07 / 12

What we built — the mechanism, not the product.

Three components. Each one answers one of the questions on the previous slide. Each one is independently verifiable.

I · Attestation

A receipt for every inference call.

Every inference call is wrapped. A cryptographic receipt is emitted with the inputs, the declared policy, and the output — signed at the moment of execution.

Ed25519 · canonical JSON
No PHI egress

II · Drift detection

Continuous statistical monitoring.

A behavioral baseline is established and watched. Distributional shift is caught as it happens — not weeks later in a retrospective review.

Page-CUSUM · auto-tuning
Streaming, per‑deployment

III · Verification

Offline, independent replay.

A Python verifier reads the receipt stream and confirms the chain. Your auditor verifies our evidence without trusting us, our servers, or our certificates.

407 lines · stdlib only
Zero network

07 / 12 VI · the mechanism

§ VII · Why this maps to your sandbox 08 / 12

From narrative case files to a queryable evidence base.

Every sandbox participant generates structured, comparable evidence against the same schema.

Your office moves from staff-judgment approvals toward generalizable rules, at the pace your team decides — not at the pace of the next hiring cycle.

Participants carry their attestation record with them when they deploy downstream.

OAIP sandbox → Intermountain → U of U Health → Mountain states

08 / 12 VII · downstream

§ VIII · What we've learned from design partners 09 / 12

Skin in the game.

Not a credentials slide. Where our evidence comes from, in one breath.

CHAI Partnership track on evaluation infrastructure, ongoing since 2025.

OVERT 1.0 Open evidence standard, 170-page spec, published and referenceable.

Insurability White paper on actuarial readiness for AI systems — Jennifer, published this week.

Autoredteam Open-source release, announced at the Allen Institute keynote.

Practitioners Jen's clinical practice, plus the mental-health-chatbot origin story that brought us here.

09 / 12 VIII · provenance

§ IX · What we'd like to propose 10 / 12

A structured pilot, not a procurement.

— i

Instrument one existing sandbox participant with our evidence layer.

— ii

Zero cost. Zero PHI egress. Zero procurement process.

— iii

Monthly evidence report your team can independently verify — and in return, we learn what format is most useful to your review process.

10 / 12 IX · the ask

§ X · What we'd like to show you now 11 / 12

A five-stage walkthrough of how this works end to end.

One sentence per stage. The demo itself follows.

Declare

The participant states the policy the system is supposed to follow, in machine-readable form.

Discover

We establish the behavioral baseline from real, consented traffic before anything goes live.

III

Defend

Every inference call in production is wrapped and the declared control is enforced at runtime.

Detect

Drift from the baseline is flagged as it happens, with the evidence attached.

Prove

Your auditor re-runs the verifier offline and confirms the record, months or years later.

Declare → Discover → Defend → Detect → Prove

11 / 12 X · the demo

§ XI · Closing 12 / 12

The sandbox participants you evaluate today will deploy across the mountain states tomorrow. The evidence record you require is what makes that deployment trustworthy.

— Two questions, and then you talk

What would the ideal evidence format look like from your side?

Who else from your office should see this?

12 / 12 XI · thank you

Evidence infrastructure for the sandbox era.

We listened to Zach at CHAI. Three things he said changed how we're thinking about the sandbox.

Three questions your sandbox has to answer eventually — and none of them can be answered from vendor self-reports alone.

What we built — the mechanism, not the product.

A receipt for every inference call.

Continuous statistical monitoring.

Offline, independent replay.

From narrative case files to a queryable evidence base.

Skin in the game.

A structured pilot, not a procurement.

A five-stage walkthrough of how this works end to end.

Declare

Discover

Defend

Detect

Prove

The sandbox participants you evaluate today will deploy across the mountain states tomorrow. The evidence record you require is what makes that deployment trustworthy.